Python For Data Science Cheat Sheet 


: Basic Information Reducing 
PySpark - RDD Basics >>> rdd.reduceByKey (lambda x,y : x+y) Merge the rdd values for 
: x >>> rdd.getNumPartitions () List the number of partitions .collect () each key 
Learn Python for data science Interactively at www.DataCamp.com >>> rdd.count () Count RDD instances [('a',9), ('b', 2) ] 
3 >>> rdd.reduce(lambda a, b: a + b) Merge the rdd values 
>>> rdd.countByKey () Count RDD instances by key ("a',7,'a',2,'b"',2) 


Retrieving RDD Information Reshaping Data 
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Grouping by 


>>> rdd.countByValue () Count RDD instances by value >>> rdd3.groupBy (lambda x: x % 2) Return RDD of grouped values 
Spark defaultdict (<type 'int'>,{('b',2):1, ('a',2):1, ('a',7):1}) .mapValues (list) 
: >>> rdd.collectAsMap () Return (key,value) pairs as a .collect () 
PySpark is the Spark Python API that exposes kki ('a': 2,"b': 2} dainn l >>> rdd N i Group rdd by key 
: Bis S RDD t .mapValues (list 

the Spark programming model to Python. spark: Paso ASN a aca "collect {) 

>>> sc.parallelize([]).isEmpty () Check whether RDD is empty (ats 722135 Ct, 12191 
e,o jo o True n 
Initializing Spark Aggregating 


>>> seqOp = (lambda x,y: (x[0]+y,x[1]+1)) 
- >>> combOp = (lambda x,y: (x[0]+y[0],x[1]+y[1])) 
SparkContext ae ae -max () Maximum value of RDD elements >>> rdd3.aggregate((0,0),seqOp, combOp) Aggregate RDD elements of each 
A P we 4950,100) partition and then the results 
>>> from pyspark import SparkContext > ( 
>>> sc = SparkContext (master = 'local[2]') a rdd3.min () Minimum value of RDD elements >>> rdd.aggregateByKey ((0,0) , seqop, combop) | Aggregate values of each RDD key 
>>> rdd3.mean () Mean value of RDD elements gt toh hecho) o 
Inspect SparkContext 49.5 a [('a',(9,2)), ('b', (2,1))] 
>>> rdd3.stdev () Standard deviation of RDD elements >>> rdd3.fold(0,add) Aggregate the elements of each 
>>> sc.version Retrieve SparkContext version 28 .866070047722118 . 4950 partition, and then the results 
>>> sc.pythonVer Retrieve Python version >>> rdd3.variance() Compute variance of RDD elements >>> rdd.foldByKey(0, add) Merge the values for each key 
>>> sc.master Master URL to connect to 833.25 i c hi by bi -collect () 
>>> str (sc. sparkHome) Path where Spark is installed on worker nodes ey Pow ce 3 sais ompute histogram by bins [('a', 9), ('b', 2)] 
>>> str(sc.sparkUser Retrieve name of the Spark User runnin ipaa Macau r pg TS r 
i 7 o SparkContext 7 3 >>> rdd3.stats () Summary statistics (count, mean, stdev, max & >>> rdd3. ray ~ xo X+) a are tia elements by 
>>> sc.appName Return application name min) s COL SEC applyingia Function 
>>> sc.applicationId Retrieve application ID h | 
>>> sc.defaultMinPartitions | Default minimum number of partitions for Applying Functions Mat ematica Operations 
>>> rdd.map (lambda x: x+ (x ae a function to eac! element >>> rdd.subtract (r eturn each rdd value not containe 
RDDs dd.map (lambd (x[1],x[0])) Apply a functi HRDD el dd. sub (rdd2) R h rdd val ined 


-collect () 
r7,7,'a'),('a',2,2,'a'), 


«Collect () 
[('b',2),('a',7)] 
>>> rdd2.subtractByKey (rdd) 
.collect () 
[('d', 1)] 
>>> rdd.cartesian (rdd2) .collect () 


in rdd2 


>>> from pyspark import SparkConf, 
>>> conf = (SparkConf () 
.setMaster ("local") 
.setAppName ("My app") 
.set ("Spark.executor.memory", 
SparkContext (conf = conf) 


("b',2,2,'b')] 
[0])) 


rdd.flatMap (lambda x: x+(x[1],x 


Apply a function to each RDD element 


Return each (key,value) pair of rdd2 
and flatten the result 


with no matching key in rdd 


SparkContext 


>>> rdd5.collect () 
['a',7,7,'a','a',2,2,'a','b',2,2,'b"] 
>>> rdd4.flatMapValues (lambda x: x) 
-collect () 
[Ca "x", (a',"y'), ('a','z"), ("b',"p'), ('b', 'r')] 


Using The Shell Selecting Data 


Return the Cartesian product of rad 
Apply a flatMap function to each (key,value) and rdd2 


pair of rdd4 without changing the keys 


"1g")) 
>>> sc = 


>>> rdd2.sortBy (lambda x: x[1]) Sort RDD by given function 
ee í .collect () 
In the PySpark shell, a special interpreter-aware SparkContext is already Getting reat ij, ('b', 1), ('a',2)] 
created in the variable called sc. >>> rdd.collect () Return a list with all RDD elements >>> rdd2.sortByKey () Sort (key, value) RDD by key 
: Gat, 7), (tat, 2), Cb; 2)] -collect () 

$ ./bin/spark-shell --master local[2] _ >>> xrdd. take (2) Take first 2 RDD elements [('at,2), ('b',1), ('d',1)] | 

$ ./bin/pyspark --master local[4] --py-files code.py Ca', 7), (Ca', 2)] 
; ; >>> first Take first RDD element ware g 
Set which master the context connects to with the --master argument, and o a D 0) Repartitioning 
add Python .zip, .egg or .py files to the runtime path by passing a >>> rdd. top (2) Take top 2 RDD elements 
RS TEFRS TEAL a CD' ` 2) dai mI >>> rdd.repartition (4) New RDD with 4 partitions 
p Ey =S% s i U á á >>> rdd.coalesce (1) Decrease the number of partitions in the RDD to 1 
a ampling 
Loading Data >>> rdd3.sample (False, 0.15, 81).collect () |Return sampled subset of rdd3 
3,4, 27,31,40, 41,42, 43, 60,76, 79, 80, 86, 97] 
Parallelized Collections Filtering | l >> rdd.saveAsTextFile("rdd. txt") 
>>> rdd.filter (lambda x: "a" in x) Filter the RDD >>> rdd.saveAsHadoopFile ("hdfs://namenodehost/parent/child", 

>>> o. Seopa nck ee of eee acre es a ) 'org.apache.hadoop.mapred.TextOutputFormat' ) 
>>> r sc.para e11ıze a y, r , r , r £ r 

>>> rdd3 sc.parallelize (range (100) ) >>> rdd5.distinct().collect() Return distinct RDD values z 
>>> rdd4 = sc.parallelize([("a", ["x","y","z"]), rat, 2y ya] : Stopping Spar Context 

("b", ["p", "r"])]) >>> rdd.keys () .collect () Return (key,value) RDD's keys 
3 T ‘at, tat, 'b'] >>> sc.stop() 
xternal Data 

Read either one text file from HDFS, a local file system or or any ( Iterating 

Hadoop-supported file system URI with textFile (), or read in a directory >>> def g(x): print (x) 

of text files with wholeTextFiles (). ae a Apply a function to all RDD elements 

a’, 

>>> textFile sc.textFile("/my/directory/*.txt") CB, 2) 
>>> textFile2 sc.wholeTextFiles ("/my/directory/") (Ta; 2) ReamnbythouforDaeaSeence 


